Appendix E — Assignment E

Multicollinearity; Qualitative & Quantitative predictors

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.

  3. Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.

  4. The assignment is worth 100 points, and is due on 12th November 2023 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
  • Final answers of each question are written clearly (1 pt).
  • The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

E.1 Prostate

A study was conducted on 97 men with prostate cancer who were due to receive a radical prostatectomy. The dataset prostate.csv contains data on 9 measurements made on these 97 men. The description of variables can be found here.

E.1.1 Fitting the model

Fit a linear regression model with lpsa as the response and the other variables as the predictors. Write down the equation to predict lpsa based on the other eight variables. When coding the model in R, do not type the names of all the predictors.

(2 points)

E.1.2 Model significance

Is the overall regression significant at 5% level? Justify your answer.

(2 points)

E.1.3 Interpretation

Interpret the coefficient of svi.

(3 points)

E.1.4 Effect of other predictors

Fit a simple linear regression on lpsa against gleason. Is the predictor gleason statistically significant in this model?

Was gleason statistically significant in the model developed in the first question (E.1.1) with multiple predictors?

Did the statistical significance of gleason change in the absence of other predictors? Why or why not?

(1 + 1 + 1 + 2 = 5 points)

E.1.5 Relevant subset of predictors

Find the largest subset of predictors in the model developed in the first question (E.1.1), such that their coefficients are zero, i.e., none of the predictors in the subset are statistically significant. Assume a significance level of 5%.

Does the model \(R^2_{adj}\) change a lot if you remove the set of predictors identified above from the model in the first question? Why or why not?

(4 + 2 = 6 points)

E.2 Infant mortality

The dataset infmort.csv gives the infant mortality of different countries in the world. The column mortality contains the infant mortality in deaths per 1000 births.

E.2.1 Data visualisation

Continue with the model developed in C.2.6. Visualize the distribution of the transformed response in C.2.6 for each region, i.e., each continent. Based on the plots, does region seem to be associated with infant mortality.

Note: You may use the R function boxplot() to visualize the distribution.

(2 + 1 = 3 points)

E.2.2 MLR

Update the model developed in in C.2.6 by adding region as a predictor. Use this model to compute adjusted_mortality for each observation in the data, where adjusted mortality is the mortality after removing the estimated effect of income. Make a boxplot of log(adjusted_mortality) against region.

(2 + 1 = 3 points)

E.2.3 Analysis

From the plot in the previous question (E.2.2):

  1. Does Europe still seem to have the lowest mortality as compared to other regions after removing the effect of income from mortality?

  2. After adjusting for income, is there any change in the mortality comparison among different regions. Compare the plot developed in the previous question (E.2.2) to the plot of log(mortality) against region developed earlier (E.2.1) to answer this question.

Hint: Do any African / Asian / American countries seem to do better than all the European countries with regard to mortality after adjusting for income?

(1 + 3 = 4 points)

E.3 GDP per capita

The dataset soc_ind.csv contains the GDP per capita of some countries along with several social indicators.

E.3.1 Best predictor

For a simple linear regression model predicting gdpPerCapita, justify that lifeFemale (female life expectancy) will provide the best model fit (ignore categorical predictors)?

(1 point)

E.3.2 Polynomial transformation

Develop a linear regression model to predict gdpPerCapita based on the appropriate polynomial transformation of lifeFemale. Find the appropriate polynomial transformation using the General Linear test approach. Support your choice of the appropriate polynomial transformation with a plot of the model on the data, showing the model fit.

(3 + 2 = 5 points)

E.3.3 VIF

Is there multicollinearity in the model? Find the VIF and comment.

(2 points)

E.3.4 Centering

Center the predictor lifeFemale and re-develop the model. Did the severity of multicollinearity reduce in the updated model? If yes, what is the benefit of the reduced multicollinearity. Use the model summary of both the models as an evidence to support your arguments.

(2 + 2 + 2 + 2 = 8 points)

E.3.5 Interpretation

Develop a model to predict gdpPerCapita with lifeFemale and continent as predictors.

  1. Interpret the intercept term.

  2. For a given value of lifeFemale, are there any continents that do not have a significant difference between their mean gdpPerCapita and that of Africa? If yes, then which ones, and why? If no, then why not? Consider a significance level of 5%.

(2 + 2 = 4 points)

E.3.6 Interaction

The model developed in the previous question has a limitation. It assumes that the increase in mean gdpPerCapita with a unit increase in lifeFemale does not depend on the continent.

  1. Eliminate this limitation by including interaction of continent with lifeFemale in the model developed in the previous question (E.3.5). Print the model summary of the model with interactions.

  2. Interpret the coefficient of any one of the interaction terms.

(1 + 2 = 3 points)

Note: Use the model developed in this question (E.3.6) for all the questions below.

E.3.7 Model visualisation

Use the model developed in (E.3.6) to plot regression lines for Africa, Asia, and Europe. Put gdpPerCapita on the vertical axis and lifeFemale on the horizontal axis. Use a legend to distinguish among the regression lines of the three continents.

(3 points)

E.3.8 Analysing the model

Based on the plot develop in the previous question (E.3.7), which continent has the highest increase in mean gdpPerCapita for a unit increase in lifeFemale, and which one has the least? Justify your answer.

(1 + 1 + 2 = 4 points)

E.3.9 Quantifying difference in slope

On an average, how much higher is the increase in GDP per Capita of a European country than increase in GDP per Capita of an Asian country for an increase of 1 year in female life expectancy?

(2 points)

E.3.10 Quantifying uncertainty of difference in slope

Find the 95% confidence interval of the metric obtained in the previous question.

In other words, find the 95% confidence interval of the difference of increase in GDP per Capita of a European country and increase in GDP per Capita of an Asian country, for an increase of 1 year in female life expectancy.

Note: You may use the R function vcov() to find the variance-covariance matrix of the regression coefficients:

\(Var(\hat{\beta}) = \sigma^2(X^TX)^{-1}\)

(5 points)

E.3.11 Quantifying difference in prediction

On an average, for a female life expectancy of 72 years, how much higher is the GDP per Capita of an Asian country than the GDP per Capita of a European country?

(5 points)

E.3.12 Uncertainty of difference in prediction

Find the 95% confidence interval of the metric obtained in the previous question.

In other words, find the 95% confidence interval of the difference of GDP per Capita of an Asian country and the GDP per Capita of a European country, for a female life expectancy of 72 years.

(10 points)

Note: You may use the R function vcov() to find the variance-covariance matrix of the regression coefficients:

\(Var(\hat{\beta}) = \sigma^2(X^TX)^{-1}\)

E.3.13 Statistical significance

Based on the confidence interval obtained in the question (E.3.12), is the GDP per Capita of an Asian country significantly different from the GDP per Capita of a European country, for a female life expectancy of 72 years?

(1 point)

E.3.14 Modeling approach

The developed model (in E.3.6) can be used to predict the GDP per Capita for any country, based on its female life expectancy and continent.

Instead of developing a single model, consider the scenario that a distinct model is developed for each continent, i.e., a distinct model is developed for Asia based on the data only corresponding to Asia, a distinct model is developed for Europe based on the data only corresponding to Europe, and so on. Now answer the following questions with justification for each:

E.3.14.1

Will the expected GDP per capita of a country for a given female life expectancy from the model developed in E.3.6 be different from the model developed only based on the country’s continent data?

E.3.14.2

Will the confidence interval of the expected GDP per capita of a country for a given female life expectancy from the model developed in E.3.6 be different from the model developed only based on the country’s continent data?

E.3.14.3

What is the advantage of one approach over the other, and vice-versa? In other words, what is the advantage of the approach of developing a single model for all continents (as in E.3.6) in contrast to developing a separate model for each continent. And, what is the advantage of the approach of developing a separate model for each continent in contrast to developing a single model for all continents.

(4 + 4 + 6 = 14 points)